The maximum relative entropy principle 
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We show that the naive application of the maximum entropy principle can yield answers which 
depend on the level of description, i.e. the result is not invariant under coarse-graining. We demon- 
strate that the correct approach, even for discrete systems, requires maximization of the relative 
entropy with a suitable reference probability, which in some instances can be deduced from the 
symmetry properties of the dynamics. We present simple illustrations of this crucial yet surprising 
feature in examples of classical and quantum statistical mechanics, as well as in the field of ecology. 
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There are numerous situations in the natural and social 
sciences, medicine, and business which can be described 
at different levels of detail in terms of probability distri- 
butions. Such descriptions arise either intrinsically as in 
quantum mechanics, or because of the vast amount of de- 
tails necessary for a complete description as, for example, 
in Brownian motion and in many body systems. Varia- 
tional methods can be used for constructing an estimate 
of the underlying probability distribution when one does 
not have sufficient information to deduce it exactly. The 
principle of maximum entropy 0, 0, 0, 0, 0, 0, 0, 0, G3 
is a widely used variational method for the analysis of 
both complex equilibrium and non-equilibrium systems 
and is being increasingly employed in a variety of con- 
texts such as image reconstruction II ill . crystallography 
12 1, the earth sciences ecology [3, protein science 
15 j , nuclear magnetic resonance spectroscopy fltj| , X-ray 
diffraction [13], plasma physic s [Till , condensed matter 
physics 0, planetary studies [2fj||. electron microscopy 
[2l| . and neuroscience [22| . 

The maximum entropy principle: Consider the 
maximization of the entropy |2j 



H{P) = -Y,P{C) In P(C), 



(1) 



where P(C) is the probability that a certain event C oc- 
curs subject to the constraints: 



(Q r )=Y,P(C)Qr(C) = Qr 



0,1,2, 



(2) 



where the r-th constraint requires that the mean value 
of a quantity Q r is equal to Q r . r = is a normaliza- 
tion condition that ensures that J2c P(C) — 1 and results 
from selecting Qo = Qo = 1- For r > 1, Q r is obtained 
from the partial knowledge one has about the system. 
The logic underlying the variational method follows from 
the link between information and entropy Q - the more 
information one has, the lower the entropy. The entropy 
is reduced as a result of the partial knowledge encoded in 
Q r . The entropy maximization principle arises from the 



observation that the entropy must be the highest possible 
that includes the available information, because a lower 
entropy would imply that more information has been in- 
corporated than is available. Using Lagrange multipliers, 
A r , to impose the constraints, one seeks to maximize 



- p ( c ) ln p ( c ) -J2 Xr ^( c ) 



The general solution is found to be 



P(C) 



£ r A r Q r (C) 



(3) 



(4) 



where the A's have to be determined in order to satisfy 
the constraints @. In order to illustrate the potential 
problems associated with a naive application of the 
maximum entropy principle, we begin with a simple 
classical example. 

Application of the maximum entropy principle 
to classical statistical mechanics derivation of 
Boltzmann statistics: Consider a canonical ensemble 
of N non-interacting particles that can occupy discrete, 
non-degenerate energy levels (the extension to the degen- 
erate case is straightforward but results in more cumber- 
some equations) having an energy ej, with j = 1, 2, 3, 

We impose an energy constraint: Q\ = J2j( n j) e j = 
where tij is the number of particles in level j and ( • ) de- 
notes the average value. The temperature is a measure 
of the average total energy. The goal is to determine 
the probability, P, of observing a given distribution of 
particles among the levels subject to a constraint on the 
average total energy. For the classical case, the parti- 
cles are distinguishable, the identities of the particles are 
known and the energy level occupied by the a-th parti- 
cle can be indicated by i a , a = 1, N. A straightforward 
application of the maximum entropy principle yields the 
well-known Boltzmann result: the probability of observ- 
ing the particle configuration i 

Pfl(i) cce-^-i."^^'" e '', (5) 

where i denotes the event in which particle 1 is in level i\, 
particle 2 is in level £2, and so on, the Lagrange parame- 



2 



ter (3 is proportional to the inverse temperature and the 
subscript B stands for Boltzmann. In a coarse-grained 
description (23[, in which one keeps track of just the 
number of particles in each level (the occupation num- 
ber representation), the relevant event is n, where n\ is 
the number of particles in level 1, is the number of 
particles in level 2, and so on, without regard to their 
identity. Within the n description and starting from Eq. 
([5|), one obtains for the probability, P' B (n), of the event 



P' B (n) cx 



1 



IW 



3 -/3 E,i 



(0) 



where the prime superscript denotes that the result has 
been obtained on coarse-graining. 

Quantum statistics and a puzzle: The surprising 
feature of the maximum entropy principle is that its di- 
rect application to the quantity P(n) yields a result dif- 
ferent from Eq. ([6]). (The constraints imposed on the 
system in terms of i are also expressible in terms of the 
coarse-grained event description n.) One obtains instead 
the celebrated Bose-Einstein distribution: 



P BE (n) oce-^ n < 



0,1,2,, 



(7) 



On constraining rij = 0, 1 for all j one gets the Fermi- 
Dirac statistics. This result is pleasing because the n 
representation is in fact the appropriate one for deriving 
quantum statistics. The particles are indistinguishable 
and all the information that one has is encapsulated 
by -P(n). The conundrum is that the results obtained 
by applying the maximum entropy principle to P(i) 
and then coarse-graining the result to obtain P(n) is 
different from applying the maximum entropy principle 
directly to -P(n). In other words, the operations of 
maximizing the entropy and of coarse-graining do not 
commute. 

Relative entropy and resolution of the puzzle: 

We suggest that the correct and consistent application 
of the maximum entropy pri ncip le entails the maximiza- 
tion of the relative entropy 24[ instead of the Shannon 
entropy in Eq. (frj subject again to the constraints ob- 
tained from partial knowledge that one has about the 
system. The relative entropy of P with respect to Pq is 
defined as E3 



H(P\P ) 



(8) 



where the new term in the denominator Pq(C) is a ref- 
erence term. Such a reference term has been discussed 
in the literature in the different context of going from 
a discrete to a continuous system and is "proportional 
to the limiting density of discrete points" [7|, where it 



is needed for dimensional reasons. The reference term 
is, however, not commonly invoked as an essential ingre- 
dient in the discrete case. It has been shown by Shore 
and Johnson [j| that "given a continuous prior density 
and new constraints, there is only one posterior density 
satisfying these constraints that can be chosen by a pro- 
cedure that satisfies the axioms". The unique posterior 
can be obtained by maximizing the relative entropy and 
the axioms pertain to uniqueness, invariance, system in- 
dependence and subset independence. If Pq(C) can be 
chosen to be a constant or simply equal to 1, Eq. ([5]) 
becomes equivalent to Eq. |T]). Due to the convexity of 
the function x\nx, the relative entropy is never positive 
and it reaches its maximum value of zero when P = Pq. 
In the absence of any constraint, the maximization of the 
relative entropy leads to the result P(C) = Pq(C). 

If the space of events is coarse grained, i.e. it is parti- 
tioned into subsets C, which are pair disjoined, represent- 
ing collections of events in C, then the relative entropy is 
given by 



H{P'\P!>) 



V p'(C) in 



c 



(9) 



where the reference term Pg(C') is obtained straightfor- 
wardly by coarse-graining Pq(C) as 



(10) 



CeC 



If the constraints Eq. ([2]) are functions only of C then the 
relative entropy maximization commutes with the oper- 
ation of coarse graining and one obtains 



,(C) 



(11) 



In the derivation of Eq. ([5]), it was implicitly as- 
sumed that Po,s(i) = 1- Cm coarse-graining to a 
description involving the variable n, Eq. (TIT))) leads to 
Pq B (n) = N\/Yljn,j\ yielding once again the standard 
Boltzmann distribution, Eq. ([B]). If, instead, one 
assumes that Po j B_e(n) = 1 then one derives the Bose 
Einstein distribution, Eq. ([7|)- 

Role of system dynamics: The success of the prin- 
ciple of maximum entropy hinges on the choice of the 
reference probability, Pq, and the identification of the 
correct constraints not encapsulated in Pq . In the statis- 
tical mechanics examples studied above, the constraint is 
imposed by fixing the average energy while the choice of 
Pq is guided by the postulate that all states are a priori 
equally probable when one works at the finest level of 
description for the system being studied. Of course, this 
follows from the dynamics of the system. 

Consider the dynamics, in terms of a Markov 
process, in the occupation number representation. 



If the transition rate, w quantum (rij 



1) 



3 



quantum ^ . 



1)) is proportional to rij + 1 



(rij) then, in the stationary state, Po,B£;( n ) = constant 
in agreement with the implicit choice made for the 
Bose-Einstein case, Eq. ([7]). These transition rates 
follows from the symmetry of the quantum wave func- 
tion describing indistinguishable particles 25|. For 



classical (distinguishable) particles, the transition rate 



classical 



w 

transition rate W 



(rij 



+ 1) is simply constant whereas the 



classical 



1) is proportional 



to rij. In the stationary state, Pq B (n) is proportional to 



i/IL 



which, when used as the reference probability, 



correctly leads to Eq. (JGJ) . 

this is equivalent to Po,s(i) 



At the description level i, 
=constant. 



An ecology application: A fundamental quantity 
in ecology is the probability distribution of the species 
abundance, i.e. the probability, Peco^), that the first 
species has a population m, the second species, rii and 
so on. As an illustration, consider the simple symmetric 
case in which all species are demographically equivalent 
[2^ and are governed by similar death and birth rates. 
The direct application of the maximum entropy principle 
without the appropriate non-trivial reference term and 
with the constraint that the average population, . rij } , 
is fixed yields a simple exponential form for the species 
abundance 



(12) 



The relative species abundance (RS A) ( n ) = 

( 3n,n k ), the probability that the k-th species has popu- 
lation n, is thus proportional to exp(— j3n). 

In order to choose the reference entropy, we turn to the 
dynamics as a guide. Consider a Markov process with 



transition rates W eco (rij 



± 1) = rij + c where c is 



a constant term that, for simplicity, is species indepen- 
dent. When c = 0, one has a simple birth-death process, 
whose rate is proportional to the number of individu- 
als of a given species. A non-zero value of c introduces 
density dependence in the birth and death rates with a 
positive value of c corresponding to a rare-species advan- 
tage [27(. The stationary state corresponding to these 
dynamics provides a measure of the reference probability 
Po,ECo(n) oc Ylj l/{rij + c). On applying the principle 
of maximum relative entropy with this reference proba- 
bility, we find 



ECO 



(n)cx[] 



-/3-n,- 



(13) 



>(*0 



RSA 



in) 



instead of Eq. (TT2]). This leads to a P^ 
cxp(~f3n)/(n + c). When c — we obtain the cele- 
brated Fisher log series [2^|. This result can also be 
obtained from the standard application of the principle 
of maximum entropy by imposing a constraint on the 
average value of Inn, a constraint with no ecological 



basis. When c is positive, one obtains the result 
derived using a density dependent neutral approach 
[2^ which fits the RSA data of several tropical forests 
fairly well. The Fisher log-series has a simple phys- 
ical interpretation: the e~^ n term results from the 
constraint on the average population whereas the 1/n 
factor follows from the dynamics. The characteristic 
time scale of a birth or death event is inversely pro- 
portional to n, the number of individuals in a given 
species - each individual is a candidate for dying or for 
giving birth. The c correction arises straightforwardly 
from density dependence in the birth and/or death rates. 

Summary: The maximum entropy principle is an 
inference technique for constructing an estimate of a 
probability distribution using available information. We 
suggest that, in order to guarantee that the results do not 
depend on the description level, one ought to maximize 
the relative entropy subject to the known constraints. 
This provides a natural interpretation of the relative 
24j | in the context of statistical mechanics 



entropy 



In 



order to be successful, the method requires knowledge of 
the reference probability, which, in turn, depends on the 
system dynamics. Alternatively 0, one could maximize 
the ordinary entropy H(P), Eq. (p}, and continue to 
add additional constraints until one obtains the correct 
P. In order to obtain the correct answer, in the absence 
of the reference entropy, one requires the knowledge of 
which optimal constraints to use (e. g. the constraint 
on the average value of Inn in the ecology illustration) 
or the use of a large enough number of constraints [6| 
to ensure convergence. Unfortunately, in general, there 
is no a priori guarantee that either of these approaches 
will be successful. 
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